Homework 03

Classification problems

Due date : 05 April 23:59

BDA - University of Tartu - Spring 2020

Homework instructions

  • Insert your team member names and student IDs in the field "Team mates" below. If you are not working in a team please insert only your name, surname and student ID

  • The accepted submission formats are Colab links or .ipynb files. If you are submitting Colab links please make sure that the privacy settings for the file is public so we can access your code.

  • The submission will automatically close at 12:00 am, so please make sure you have enough time to submit the homework.

  • Only one of the teammates should submit the homework. We will grade and give points to both of you!

  • You do not necessarily need to work on Colab. Especially as the size and the complexity of datasets will increase through the course, you can install jupyter notebooks locally and work from there.

  • If you do not understand what a question is asking for, please ask in Moodle.

Team mates:

Name Surname: Enlik -Student ID: B96323

Name Surname: Thi Thuy Nga VuStudent ID: B88416

1. Classification tasks and algorithms (8 points)

We are going to use the dataset from the file HR_Employee_Attrition.csv which contains data about the employees of a company and the fact if they have left the company due to causes like retirement, resignation, elimination of a position, personal health etc. It is important for companies to predict if their employees are going to leave because the hiring process is costly and requires planification. The data has the following columns:

Age – self descriptive

BusinessTravel – how frequent employee travels

DailyRate – daily rate on terms of salary

Department – self descriptive

DistanceFromHome – distance between employee home and work

Education – education level of employee

EducationField – self descriptive

EnvironmentSatisfaction – level of satisfaction with working environment

Gender – self descriptive

HourlyRate – self descriptive

JobRole – self descriptive

JobInvolvement – level of interest of the job

JobSatisfaction – level of satisfaction with current job

MaritalStatus – self descriptive

MonthlyIncome – self descriptive

MonthlyRate – self descriptive

NumCompaniesWorked – self descriptive

Over18 – whether customer age is more than 18

OverTime – whether customer works overtime or not

PerformanceRating – performance level of employee

RelationshipSatisfaction – level of satisfaction with working community

StandardHours – standard amount of hours that employee works

TotalWorkingYears – whether customer age is more than 18

TrainingTimesLastYear – whether customer age is more than 18

In [1]:
# !conda install matplotlib
In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
In [3]:
hr_data = pd.read_csv('HR_Employee_Attrition.csv', header=0)
hr_data.head(10)
Out[3]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender ... PerformanceRating RelationshipSatisfaction StandardHours TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 2 Female ... 3 1 80 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 3 Male ... 4 4 80 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 4 Male ... 3 2 80 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 4 Female ... 3 3 80 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 Male ... 3 4 80 6 3 3 2 2 2 2
5 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 4 Male ... 3 3 80 8 2 2 7 7 3 6
6 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 3 Female ... 4 1 80 12 3 2 1 0 0 0
7 30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 4 Male ... 4 2 80 1 2 3 1 0 0 0
8 38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 4 Male ... 4 2 80 10 2 3 9 7 1 8
9 36 No Travel_Rarely 1299 Research & Development 27 3 Medical 3 Male ... 3 2 80 17 3 2 7 7 7 7

10 rows × 30 columns

In [4]:
### Data Exploration

## equivalent to dplyr.glimpse() function in R language
# hr_data.info()

## Numerical features overview
# hr_data.describe()

## Plot data distribution
# hr_data.hist(figsize=(20,20))
# plt.show()

## Check EducationField Distribution
# hr_data.EducationField.value_counts()

## Check Gender Distribution
# hr_data.Gender.value_counts()

## Check Attrition Distribution
# hr_data.Attrition.value_counts()
# print('Percentage ex-employees is {:.2f}% and current employees is {:.2f}%'.format(
#     hr_data[hr_data['Attrition'] == 'Yes'].shape[0] / hr_data.shape[0]*100,
#     hr_data[hr_data['Attrition'] == 'No'].shape[0] / hr_data.shape[0]*100))

1.1 Dataset exploration (1.6 points)

1.1.0. Plot the correlation of the variables in the dataset with the Attrition variable. (0.4 points)

In [5]:
hr_data.Attrition.dtypes
Out[5]:
dtype('O')
In [6]:
# Pre-processing before get the correlation value
hr_data.Attrition.replace(to_replace='Yes', value = 1, inplace = True)
hr_data.Attrition.replace(to_replace='No', value = 0, inplace = True)
hr_data.isnull().sum()

hr_data_dummies = pd.get_dummies(hr_data)
hr_data_dummies.head()
Out[6]:
Age Attrition DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome ... JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Over18_Y OverTime_No OverTime_Yes
0 41 1 1102 1 2 2 94 3 4 5993 ... 0 0 1 0 0 0 1 1 0 1
1 49 0 279 8 1 3 61 2 2 5130 ... 0 1 0 0 0 1 0 1 1 0
2 37 1 1373 2 2 4 92 2 3 2090 ... 0 0 0 0 0 0 1 1 0 1
3 33 0 1392 3 4 4 56 3 3 2909 ... 0 1 0 0 0 1 0 1 0 1
4 27 0 591 2 1 1 40 3 2 3468 ... 0 0 0 0 0 1 0 1 1 0

5 rows × 51 columns

In [7]:
hr_data_dummies
Out[7]:
Age Attrition DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome ... JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Over18_Y OverTime_No OverTime_Yes
0 41 1 1102 1 2 2 94 3 4 5993 ... 0 0 1 0 0 0 1 1 0 1
1 49 0 279 8 1 3 61 2 2 5130 ... 0 1 0 0 0 1 0 1 1 0
2 37 1 1373 2 2 4 92 2 3 2090 ... 0 0 0 0 0 0 1 1 0 1
3 33 0 1392 3 4 4 56 3 3 2909 ... 0 1 0 0 0 1 0 1 0 1
4 27 0 591 2 1 1 40 3 2 3468 ... 0 0 0 0 0 1 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1465 36 0 884 23 2 3 41 4 4 2571 ... 0 0 0 0 0 1 0 1 1 0
1466 39 0 613 6 1 4 42 2 1 9991 ... 0 0 0 0 0 1 0 1 1 0
1467 27 0 155 4 3 2 87 4 2 6142 ... 0 0 0 0 0 1 0 1 0 1
1468 49 0 1023 2 3 4 63 2 2 5390 ... 0 0 1 0 0 1 0 1 1 0
1469 34 0 628 8 3 2 82 4 3 4404 ... 0 0 0 0 0 1 0 1 1 0

1470 rows × 51 columns

In [8]:
import matplotlib.cm as cm
# from sklearn.preprocessing import normalize

#Get Correlation of "Churn" with other variables:
plt.figure(figsize=(15,8))
#df_dummies.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')
plt.style.use('ggplot')

# https://stackoverflow.com/questions/47302343/what-names-can-be-used-in-plt-cm-get-cmap
# Get a color map
my_cmap = cm.get_cmap('Accent')

correlations = hr_data_dummies.corr()['Attrition'].sort_values(ascending = False)
correlations.plot(kind='bar', cmap=my_cmap)

# set titles for figure, x, y
plt.title(' Correlation by Attrition',fontsize=20)
plt.xlabel('Variable', fontsize=20)
plt.ylabel('Attrition rate',fontsize=20)

plt.xticks(fontsize = 10) 
plt.yticks(fontsize = 20) 
plt.grid(True)
plt.show() 

print('Most Positive Correlations: \n', correlations.head(5))
print('\nMost Negative Correlations: \n', correlations.tail(5))
Most Positive Correlations: 
 Attrition                           1.000000
OverTime_Yes                        0.246118
MaritalStatus_Single                0.175419
JobRole_Sales Representative        0.157234
BusinessTravel_Travel_Frequently    0.115143
Name: Attrition, dtype: float64

Most Negative Correlations: 
 YearsInCurrentRole   -0.160545
TotalWorkingYears    -0.171063
OverTime_No          -0.246118
StandardHours              NaN
Over18_Y                   NaN
Name: Attrition, dtype: float64

1.1.1. Write three interesting observation that you notice. Were they as you expected ? Please elaborate your answer in 1 - 3 sentences. (0.4 points)

Answer 1: When employee working over time, it's most likely they will leave the company because of the most high correlation with Attrition

Answer 2: Female employees are less likely to leave the company compare to male employees

Answer 3: All the employees are older than 18 years old and have same standard working hours. So, we can drop these two variables for our model, StandardHours and Over18 because there is no correlation (NaN value) at all with Attrition

1.1.2 Make a boxplot for total working years for each type of Attrition values. (0.4 points)

In [9]:
import seaborn as sns

sns.boxplot(x = hr_data.Attrition, y = hr_data.TotalWorkingYears)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bf5f4a8>
  • 0 = No
  • 1 = Yes

1.1.3. Plot the relative frequency of Attrition values (Yes/No) (0.4 points)

In [10]:
hr_data['Attrition'].value_counts()
Out[10]:
0    1233
1     237
Name: Attrition, dtype: int64
In [11]:
# !pip install cufflinks
In [12]:
hr_data.Attrition.replace(to_replace=1, value = 'Yes', inplace = True)
hr_data.Attrition.replace(to_replace=0, value = 'No', inplace = True)

fig = plt.figure(figsize=[8,4])

ax = plt.subplot()
ax.hist(hr_data['Attrition'], 
         width = 0.2,
         weights = np.ones(len(hr_data['Attrition'])) / len(hr_data['Attrition']) 
        )
ax.set_title('Attrition - Distribution in Relative Frequency')
# ax.x
plt.show()

## Using Plotly 
# # Standard plotly imports
# import plotly as py
# import plotly.figure_factory as ff
# import plotly.graph_objs as go
# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# # Using plotly + cufflinks in offline mode
# import cufflinks as cf
# cf.set_config_file(offline=True)
# import cufflinks
# cufflinks.go_offline(connected=True)

# hr_data['Attrition'].iplot(kind='hist', 
#                            title='Attrition - Distribution in Relative Frequency',
#                            xTitle='Attrition',
#                            yTitle='Frequency')

1.2 Classification (6.4 points)

We are going to predict the variable Attrition by trying different classification algorithms and comparing them. Before let's split the data into training and test set. Hint: You can apply some preprocessing as well to get better results.

In [13]:
# pre-processing
# remove StandardHours and Over18_Y because there is no correlation with Attrition
hr_data_dummies = hr_data_dummies.drop(['StandardHours', 'Over18_Y'], axis =1)
hr_data_dummies.head()
Out[13]:
Age Attrition DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome ... JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single OverTime_No OverTime_Yes
0 41 1 1102 1 2 2 94 3 4 5993 ... 0 0 0 1 0 0 0 1 0 1
1 49 0 279 8 1 3 61 2 2 5130 ... 0 0 1 0 0 0 1 0 1 0
2 37 1 1373 2 2 4 92 2 3 2090 ... 0 0 0 0 0 0 0 1 0 1
3 33 0 1392 3 4 4 56 3 3 2909 ... 0 0 1 0 0 0 1 0 0 1
4 27 0 591 2 1 1 40 3 2 3468 ... 0 0 0 0 0 0 1 0 1 0

5 rows × 49 columns

In [14]:
hr_data_dummies.columns
Out[14]:
Index(['Age', 'Attrition', 'DailyRate', 'DistanceFromHome', 'Education',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement',
       'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PerformanceRating', 'RelationshipSatisfaction', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager',
       'BusinessTravel_Non-Travel', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Department_Human Resources',
       'Department_Research & Development', 'Department_Sales',
       'EducationField_Human Resources', 'EducationField_Life Sciences',
       'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical Degree',
       'Gender_Female', 'Gender_Male', 'JobRole_Healthcare Representative',
       'JobRole_Human Resources', 'JobRole_Laboratory Technician',
       'JobRole_Manager', 'JobRole_Manufacturing Director',
       'JobRole_Research Director', 'JobRole_Research Scientist',
       'JobRole_Sales Executive', 'JobRole_Sales Representative',
       'MaritalStatus_Divorced', 'MaritalStatus_Married',
       'MaritalStatus_Single', 'OverTime_No', 'OverTime_Yes'],
      dtype='object')
In [15]:
hr_data_dummies
Out[15]:
Age Attrition DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome ... JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single OverTime_No OverTime_Yes
0 41 1 1102 1 2 2 94 3 4 5993 ... 0 0 0 1 0 0 0 1 0 1
1 49 0 279 8 1 3 61 2 2 5130 ... 0 0 1 0 0 0 1 0 1 0
2 37 1 1373 2 2 4 92 2 3 2090 ... 0 0 0 0 0 0 0 1 0 1
3 33 0 1392 3 4 4 56 3 3 2909 ... 0 0 1 0 0 0 1 0 0 1
4 27 0 591 2 1 1 40 3 2 3468 ... 0 0 0 0 0 0 1 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1465 36 0 884 23 2 3 41 4 4 2571 ... 0 0 0 0 0 0 1 0 1 0
1466 39 0 613 6 1 4 42 2 1 9991 ... 0 0 0 0 0 0 1 0 1 0
1467 27 0 155 4 3 2 87 4 2 6142 ... 1 0 0 0 0 0 1 0 0 1
1468 49 0 1023 2 3 4 63 2 2 5390 ... 0 0 0 1 0 0 1 0 1 0
1469 34 0 628 8 3 2 82 4 3 4404 ... 0 0 0 0 0 0 1 0 1 0

1470 rows × 49 columns

In [16]:
# Import Library
from sklearn.model_selection import train_test_split # for data splitting
In [17]:
X = hr_data_dummies.drop(columns = ['Attrition'])
y = hr_data_dummies.Attrition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [18]:
X_test
Out[18]:
Age DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome MonthlyRate ... JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single OverTime_No OverTime_Yes
442 36 635 10 4 2 32 3 4 9980 15318 ... 0 0 0 1 0 0 0 1 1 0
1091 33 575 25 3 4 44 2 2 4320 24152 ... 1 0 0 0 0 0 0 1 1 0
981 35 662 18 4 4 67 3 3 4614 23288 ... 0 0 0 1 0 0 1 0 0 1
785 40 1492 20 4 1 61 3 4 10322 26542 ... 0 0 0 0 0 0 1 0 1 0
1332 29 459 24 2 4 73 2 4 2439 14753 ... 0 0 1 0 0 0 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1439 36 557 3 3 1 94 2 4 7644 12695 ... 0 0 0 1 0 0 1 0 1 0
481 34 254 1 2 2 83 2 4 3622 22794 ... 0 0 1 0 0 0 1 0 0 1
124 31 249 6 4 2 76 1 3 6172 20739 ... 0 0 0 1 0 0 1 0 0 1
198 38 1261 2 4 4 88 3 3 6553 7259 ... 1 0 0 0 0 0 1 0 1 0
1229 40 369 8 2 2 92 3 1 6516 5041 ... 1 0 0 0 0 0 1 0 0 1

294 rows × 48 columns

1.2.1 Use the scikit-learn DecisionTreeClassifier with default parameters to predict the attrition value for the test set. Set the random seed to 0. Calculate the accuracy score and print it. (0.4 points)

In [19]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn import metrics

modelDT = DT()
modelDT = modelDT.fit(X_train, y_train)
preds_DT = modelDT.predict(X_test)
print('Accuracy of DecisionTreeClassifier on test set: ', metrics.accuracy_score(y_test, preds_DT))
Accuracy of DecisionTreeClassifier on test set:  0.7789115646258503

1.2.2 Plot the confusion matrix for the predicted values. Based on this matrix or your general knowledge, why accuracy is not a good metric to use in this case ? (0.4 points)

In [20]:
confusion_matrix_DT = metrics.confusion_matrix(y_test, preds_DT)
print("\n Confusion Matrix")
print(confusion_matrix_DT)
 Confusion Matrix
[[214  31]
 [ 34  15]]
In [21]:
from mlxtend.plotting import plot_confusion_matrix
confusion_matrix_DT = metrics.confusion_matrix(y_test, preds_DT)

binary = confusion_matrix_DT

fig, ax = plot_confusion_matrix(conf_mat=binary)
plt.show()
In [22]:
metrics.precision_score(y_test, preds_DT)
Out[22]:
0.32608695652173914
In [23]:
metrics.recall_score(y_test, preds_DT)
Out[23]:
0.30612244897959184

Accuracy is not a good metric for this case, because the dataset HR_Employee_Attrition.csv has unbalanced class, especially in our target variable Attrition. The frequency of Attrition = No is much bigger than frequency of Attrition = Yes, with 83.88% compare to 16.12%

In [24]:
print('Percentage of [Attrition = No] is {:.2f}% and [Attritrion = Yes] is {:.2f}%'.format(
    hr_data[hr_data['Attrition'] == 'No'].shape[0] / hr_data.shape[0]*100,
    hr_data[hr_data['Attrition'] == 'Yes'].shape[0] / hr_data.shape[0]*100))
Percentage of [Attrition = No] is 83.88% and [Attritrion = Yes] is 16.12%

1.2.3 We want to use a dumy model (not a machine learning approach) to get 83.88% accuracy. Considering the label ratios how this model would look like ? (0.4 points)

In [25]:
# from sklearn.dummy import DummyClassifier as DC

# # X_train_dummy, X_test_dummy, y_train_dummy, y_test_dummy = train_test_split(X, y, test_size=0.1, random_state=0)

# # modelDC = DC()
# # modelDC.fit(X_train_dummy, y_train_dummy)
# # preds_DC = modelDC.predict(X_test_dummy)

# # print('Accuracy of Dummy Classifier on test set: ', metrics.accuracy_score(y_test_dummy, preds_DC))

# dummy_train = pd.read_csv('HR_Employee_Attrition.csv', header=0)
# dummy_train['Employee_Leave'] = dummy_model_data['Attrition'].map({'Yes':1})
# dummy_model_data.head()

# # dummy_model_data
In [26]:
print('Percentage of [Attrition = No] is {:.2f}% and [Attritrion = Yes] is {:.2f}%'.format(
    hr_data[hr_data['Attrition'] == 'No'].shape[0] / hr_data.shape[0]*100,
    hr_data[hr_data['Attrition'] == 'Yes'].shape[0] / hr_data.shape[0]*100))
Percentage of [Attrition = No] is 83.88% and [Attritrion = Yes] is 16.12%

Answer 1:

Our model would look like:

  • We just use Attrition factor for our model and not to take into consideration of other factors such Age, MonthlyIncome, and the rest
  • That means our model will predict Attrition = 'No' as TRUE value, and Attrition = 'Yes' as FALSE value
  • It will make our model's accuracy equal to 83.88%

1.2.4 It is possible to plot the decision tree by using different plotting libraries. We are using the https://pypi.org/project/graphviz/ and sklearn.tree. Install the package and complete the code below so you will get a visualisation of our decision tree. (0.4 points)

In [27]:
# !pip install graphviz
# !conda install graphviz

# install graphviz on macOs
# https://graphviz.gitlab.io/download/
In [28]:
from sklearn.tree import export_graphviz
import graphviz
from os import system

# feature_names = list(X.columns)

dot_prod = export_graphviz(modelDT, out_file='decisionTree.dot', feature_names=list(X.columns),
                class_names=True, filled=True, rounded=True,
                special_characters=False)
graph = graphviz.Source(dot_prod)
# graph

# Reference
# https://gist.github.com/WillKoehrsen/ff77f5f308362819805a3defd9495ffd

# Convert to png using system command (requires Graphviz)
system("dot -Tpng decisionTree.dot -o decisionTree.png")

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'decisionTree.png')
Out[28]:
In [29]:
# Optimizing Decision Tree Performance
# Reference: https://www.datacamp.com/community/tutorials/decision-tree-classification-python

# Create Decision Tree classifer object
modelDT_optimized = DT(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer
modelDT_optimized = modelDT_optimized.fit(X_train,y_train)

#Predict the response for test dataset
preds_DT_optimized = modelDT_optimized.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, preds_DT_optimized))
Accuracy: 0.8469387755102041

The classification rate increased to 84.7%, which is better accuracy than the previous model (75%)

In [30]:
dot_prod_optimized = export_graphviz(modelDT_optimized, out_file='decisionTree_optimized.dot', feature_names=list(X.columns),
                class_names=True, filled=True, rounded=True,
                special_characters=False)
graph = graphviz.Source(dot_prod_optimized)
# Convert to png using system command (requires Graphviz)
system("dot -Tpng decisionTree_optimized.dot -o decisionTree_optimized.png")

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'decisionTree_optimized.png')
Out[30]:

This optimized model (using pruning method) is less complex, more explainable, and easier to understand

1.2.5 For the decision tree we modeled, what is the most important factor to decide if an employee is going to leave or not? (0.4 points)

In [31]:
# Reference about Feature Importances
# https://towardsdatascience.com/understanding-decision-trees-for-classification-python-9663d683c952

# Non-Optimized DT Model
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(modelDT.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances.head()
Out[31]:
feature importance
8 MonthlyIncome 0.084
13 TotalWorkingYears 0.079
5 HourlyRate 0.061
1 DailyRate 0.058
2 DistanceFromHome 0.053
In [32]:
# Optimized DT Model
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(modelDT_optimized.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances.head()
Out[32]:
feature importance
8 MonthlyIncome 0.315
47 OverTime_Yes 0.280
16 YearsAtCompany 0.119
41 JobRole_Sales Executive 0.093
0 Age 0.074

Answer 1: MonthlyIncome

1.2.6 Plot the classification report for the decision tree. In this case study which one out of precision and recall, would you consider more important ? Please elaborate your answer. (0.4 points)

Answer 1:

Precision is more important, because the model is better if we predict those employees who are gonna leave, so that we can plan to recruit new people or we can promote/motivate them to stay.

In [33]:
# print("- Precision: \t {0:.2f}".format(metrics.precision_score(y_test, preds_DT_optimized)))
# print("- Recall: \t {0:.2f}".format(metrics.recall_score(y_test, preds_DT_optimized)))

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.metrics import average_precision_score
from sklearn.metrics import classification_report

# # average_precision = average_precision_score(y_test, preds_DT_optimized)
# average_precision = average_precision_score(y_test, preds_DT)

# print('Average precision-recall score: {0:0.2f}'.format(
#       average_precision))

# # disp = plot_precision_recall_curve(modelDT_optimized, X_test, y_test,label="AP= "+str(average_precision))
# disp = plot_precision_recall_curve(modelDT, X_test, y_test,label="AP= "+str(average_precision))
# disp.ax_.set_title('2-class Precision-Recall curve: '
#                    'AP={0:0.2f}'.format(average_precision))
# plt.legend()


print(classification_report(y_test, preds_DT, target_names=['Yes','No']))
print(classification_report(y_test, preds_DT_optimized, target_names=['Yes','No']))
              precision    recall  f1-score   support

         Yes       0.86      0.87      0.87       245
          No       0.33      0.31      0.32        49

    accuracy                           0.78       294
   macro avg       0.59      0.59      0.59       294
weighted avg       0.77      0.78      0.78       294

              precision    recall  f1-score   support

         Yes       0.86      0.98      0.91       245
          No       0.64      0.18      0.29        49

    accuracy                           0.85       294
   macro avg       0.75      0.58      0.60       294
weighted avg       0.82      0.85      0.81       294

1.2.7 Calculate the F1 score of the model in training data and compare it with the F1 score in test data. What is the effect happening ? (0.4 points)

In [34]:
from sklearn.metrics import f1_score

# From Lab 06
# print(f"F1_score macro: {f1_score(y_test, preds_DT_optimized, average='macro')}")
# print(f"\nF1_score micro: {f1_score(y_test, preds_DT_optimized, average='micro')}")
# print(f"\nF1_score weighted: {f1_score(y_test, preds_DT_optimized, average='weighted')}")

# print(f"\nF1_score test data: {f1_score(y_test, preds_DT_optimized)}")


# F1-score for training data
pred_DT_train = modelDT.predict(X_train)
print(f"\nF1_score on train data: {f1_score(y_train, pred_DT_train)}")

# F1-score for test data
# print(f"F1_score macro: {f1_score(y_test, preds_DT, average='macro')}")
# print(f"\nF1_score micro: {f1_score(y_test, preds_DT, average='micro')}")
# print(f"\nF1_score weighted: {f1_score(y_test, preds_DT, average='weighted')}")

print(f"\nF1_score on test data: {f1_score(y_test, preds_DT)}")
F1_score on train data: 1.0

F1_score on test data: 0.3157894736842105

Answer 1:

  • F1-score on train data is bigger than F1-score in test data, because we fit the model using train data and it will make the f1-score perfect (equal to 1.0)

1.2.8 We can use cross validation scores to ensure that our model is generalizing well and we can be more confident when we apply it in test data. We will now try different combinations of maximum depth parameters for the decision tree and choose the best while using cross validation. Please complete the code below and report the best maximum depth. (0.4 points)

In [35]:
from sklearn.model_selection import cross_val_score
In [36]:
best_score = 0
best_depth = 0
for i in range(5,20):
#     clf = DT(max_depth=i, random_seed = 0)
    clf = DT(max_depth=i, random_state = 0)
    # Perform 5-fold cross validation. 
    # The number of folds you want to use generally depends from the size of data
    scores = cross_val_score(estimator= clf, scoring="f1", X=X, y=y, cv=5)
    mean_score = scores.mean()

    if mean_score > best_score:
        best_score = mean_score
        best_depth = i

    print('Mean score ', mean_score)

print('\n The best tree depth is: ', best_depth )
Mean score  0.34492165610020037
Mean score  0.37368237188722164
Mean score  0.3795585465126109
Mean score  0.35860890527509925
Mean score  0.3713641682963801
Mean score  0.36311875365619384
Mean score  0.3550931750890437
Mean score  0.37163221680446606
Mean score  0.37582894709659054
Mean score  0.3643644043969586
Mean score  0.34748921416593187
Mean score  0.3500625480504707
Mean score  0.35472079538780943
Mean score  0.35059891329271825
Mean score  0.35059891329271825

 The best tree depth is:  7

1.2.9 Use SVM with default parameters to classify test data and report accuracy, recall, precision, f1-score and AUC. Set the random_state equal to 0. (0.4 points)

In [37]:
# Data Pre-processing for SVM and Logistic Regression Model
# Scaling all the variables to a range of 0 to 1
from sklearn.preprocessing import MinMaxScaler

features = X.columns.values

scaler = MinMaxScaler(feature_range = (0,1))

scaler.fit(X)

X_scaled = pd.DataFrame(scaler.transform(X))

X_scaled.columns = features

# Create Train & Test Data for Logistic Regression
X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y, test_size=0.3, random_state=101)
In [38]:
X_scaled.head()
Out[38]:
Age DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome MonthlyRate ... JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single OverTime_No OverTime_Yes
0 0.547619 0.715820 0.000000 0.25 0.333333 0.914286 0.666667 1.000000 0.262454 0.698053 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0
1 0.738095 0.126700 0.250000 0.00 0.666667 0.442857 0.333333 0.333333 0.217009 0.916001 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
2 0.452381 0.909807 0.035714 0.25 1.000000 0.885714 0.333333 0.666667 0.056925 0.012126 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
3 0.357143 0.923407 0.071429 0.75 1.000000 0.371429 0.666667 0.666667 0.100053 0.845814 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
4 0.214286 0.350036 0.035714 0.00 0.000000 0.142857 0.666667 0.333333 0.129489 0.583738 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0

5 rows × 48 columns

In [39]:
# y_scaled_train.unique()
# y_scaled_test.unique()
# X_scaled_train.Age.describe()
# X_scaled_test.Age.describe()
In [40]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

model_svm = SVC(random_state=0)
# model_svm = model_svm.fit(X_train, y_train)
# preds_svm = model_svm.predict(X_test)

# print('SVM model Accuracy = ', metrics.accuracy_score(y_test, preds_svm))
# print('SVM model Recall = ', metrics.recall_score(y_test, preds_svm))
# print('SVM model Precision = ', metrics.precision_score(y_test, preds_svm))
# print('SVM model F1 Score = ', metrics.f1_score(y_test, preds_svm))
# print('SVM model AUC = ', metrics.roc_auc_score(y_test, preds_svm))

# print('SVM model Confusion Matrix \n', metrics.confusion_matrix(y_test, preds_svm))

model_svm = model_svm.fit(X_scaled_train, y_scaled_train)
preds_svm = model_svm.predict(X_scaled_test)

print('SVM model Accuracy = {:.2f}'.format(metrics.accuracy_score(y_scaled_test, preds_svm)))
print('SVM model Recall = {:.2f}'.format(metrics.recall_score(y_scaled_test, preds_svm)))
print('SVM model Precision = {:.2f}'.format(metrics.precision_score(y_scaled_test, preds_svm)))
print('SVM model F1 Score = {:.2f}'.format(metrics.f1_score(y_scaled_test, preds_svm)))
print('SVM model AUC = {:.2f}'.format(metrics.roc_auc_score(y_scaled_test, preds_svm)))

print('SVM model Confusion Matrix \n', metrics.confusion_matrix(y_scaled_test, preds_svm))


# fpr, tpr, thresholds = metrics.roc_curve(y_test, preds_svm, pos_label = 2)
# metrics.auc(fpr, tpr)
SVM model Accuracy = 0.87
SVM model Recall = 0.21
SVM model Precision = 0.88
SVM model F1 Score = 0.34
SVM model AUC = 0.60
SVM model Confusion Matrix 
 [[369   2]
 [ 55  15]]

1.2.10 Use Logistic Regression with default parameters to classify test data and report accuracy, recall, precision, f1-score, AUC. Set the random_state equal to 0 (0.4 points)

In [41]:
# Running logistic regression model
from sklearn.linear_model import LogisticRegression as LR
model_LR = LR(random_state=0)
# model_LR = model_LR.fit(X_train, y_train)
# pred_LR = model_LR.predict(X_test)

# print('Accuracy of Logistic Regression classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_test, pred_LR)))
# print('Recall of Logistic Regression classifier on test set: {:.2f}'.format(metrics.recall_score(y_test, pred_LR)))
# print('Precision of Logistic Regression classifier on test set: {:.2f}'.format(metrics.precision_score(y_test, pred_LR)))
# print('F1-Score of Logistic Regression classifier on test set: {:.2f}'.format(metrics.f1_score(y_test, pred_LR)))
# print('AUC Score of Logistic Regression classifier on test set: {:.2f}'.format(metrics.roc_auc_score(y_test, pred_LR)))
# print('Logistic Regression Confusion Matrix = \n', metrics.confusion_matrix(y_test, pred_LR))

model_LR = model_LR.fit(X_scaled_train, y_scaled_train)
pred_LR = model_LR.predict(X_scaled_test)

print('Accuracy of Logistic Regression classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_scaled_test, pred_LR)))
print('Recall of Logistic Regression classifier on test set: {:.2f}'.format(metrics.recall_score(y_scaled_test, pred_LR)))
print('Precision of Logistic Regression classifier on test set: {:.2f}'.format(metrics.precision_score(y_scaled_test, pred_LR)))
print('F1-Score of Logistic Regression classifier on test set: {:.2f}'.format(metrics.f1_score(y_scaled_test, pred_LR)))
print('AUC Score of Logistic Regression classifier on test set: {:.2f}'.format(metrics.roc_auc_score(y_scaled_test, pred_LR)))
print('Logistic Regression Confusion Matrix \n', metrics.confusion_matrix(y_scaled_test, pred_LR))
Accuracy of Logistic Regression classifier on test set: 0.88
Recall of Logistic Regression classifier on test set: 0.37
Precision of Logistic Regression classifier on test set: 0.74
F1-Score of Logistic Regression classifier on test set: 0.50
AUC Score of Logistic Regression classifier on test set: 0.67
Logistic Regression Confusion Matrix 
 [[362   9]
 [ 44  26]]

1.2.11 One of the parameters for the Logistic regression is tol which sets the tolerance for the stopping criteria. We are going to calculate the log loss metric for different values of tol. Please fill in the code below and plot the log loss values. Which one of tol values is better for our model based on log loss? (0.4 points)

In [42]:
from sklearn.linear_model import LogisticRegression as LR

log_loss = []
for tol in [0.9, 0.5, 0.1,  0.001, 0.0001, 0.000001, 0.000001]:
    lr = LR(tol = tol, random_state = 0 )

    lr.fit(X_scaled_train, y_scaled_train)
    probs = lr.predict_proba(X_scaled_test)
#     probs = probs[:,1]
    
    log_loss.append(metrics.log_loss(y_scaled_test, probs))

# print(log_loss)

# Plot Reference
# https://stackoverflow.com/questions/44813601/how-to-set-x-axis-values-in-matplotlib-python
tol_values = [0.9, 0.5, 0.1,  0.001, 0.0001, 0.000001, 0.000001]
x_axis = list(range(len(tol_values)))

plt.plot(x_axis, log_loss, marker='o', linestyle='--', color='r') 
plt.xlabel('Tol Values')
plt.ylabel('Log Loss Values') 
plt.xticks(x_axis, tol_values)
plt.title('Log Loss Values on Different Values of Tol')
plt.show()

Answer 1: 0.5 is the best tol values with the lowest log loss score.

Some theory explanation about Log Loss:

  • The goal of our machine learning models is to minimize this value
  • A perfect model would have a log loss of 0

1.2.12 Use Random Forest with default parameters to classify test data and report accuracy, recall, precision and f1-score and AUC. Set the random_state equal to 0. Please build as well a classification report separately which shows the metrics for each class. (0.4 points)

In [43]:
from sklearn.ensemble import RandomForestClassifier
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
model_rf = RandomForestClassifier(random_state = 0)
model_rf.fit(X_train, y_train)
pred_rf = model_rf.predict(X_test)

print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_test, pred_rf)))
print('Recall of Random Forest classifier on test set: {:.2f}'.format(metrics.recall_score(y_test, pred_rf)))
print('Precision of Random Forest classifier on test set: {:.2f}'.format(metrics.precision_score(y_test, pred_rf)))
print('F1-Score of Random Forest classifier on test set: {:.2f}'.format(metrics.f1_score(y_test, pred_rf)))
print('AUC Score of Random Forest classifier on test set: {:.2f}'.format(metrics.roc_auc_score(y_test, pred_rf)))
print('Random Forest - Confusion Matrix \n', metrics.confusion_matrix(y_test, pred_rf))
Accuracy of Random Forest classifier on test set: 0.86
Recall of Random Forest classifier on test set: 0.18
Precision of Random Forest classifier on test set: 0.90
F1-Score of Random Forest classifier on test set: 0.31
AUC Score of Random Forest classifier on test set: 0.59
Random Forest - Confusion Matrix 
 [[244   1]
 [ 40   9]]

1.2.13 Get the probabilities for each class from Random Forest model. Threshold the probabilities such that it will output the class No only if the model is 70% or higher confident. In all other cases it will predict the class Yes. (0.4 points)

In [44]:
# pred_rf_proba [:,0]
# pred_rf_proba
In [45]:
# Reference:
# https://stackoverflow.com/questions/49785904/how-to-set-threshold-to-scikit-learn-random-forest-model

threshold = 0.7 #set threshold 70%

pred_rf_proba = model_rf.predict_proba(X_test)
pred_rf_proba_th = (pred_rf_proba [:,1] >= threshold).astype('int')

# print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_test, pred_rf_proba_th)))

1.2.14 Build again the classification matrix. Do you think there were some improvements regarding the classification for class Yes ? Explain your answer briefly. (0.4 points)

In [46]:
print('Random Forest with Threshold - Confusion Matrix \n', metrics.confusion_matrix(y_test, pred_rf_proba_th))
Random Forest with Threshold - Confusion Matrix 
 [[245   0]
 [ 48   1]]

Answer 1: Yes, it is

  • We set the decision threshold of the model to maximize either Recall or Specifity
  • The default threshold for RandomForestClassifier is 0.5, so use that as a starting point

TowardsDataScience

1.2.15 Use XGBoost with default parameters to classify test data and report accuracy, recall, precision, f1-score and AUC. (0.4 points)

In [47]:
X_test
Out[47]:
Age DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome MonthlyRate ... JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single OverTime_No OverTime_Yes
442 36 635 10 4 2 32 3 4 9980 15318 ... 0 0 0 1 0 0 0 1 1 0
1091 33 575 25 3 4 44 2 2 4320 24152 ... 1 0 0 0 0 0 0 1 1 0
981 35 662 18 4 4 67 3 3 4614 23288 ... 0 0 0 1 0 0 1 0 0 1
785 40 1492 20 4 1 61 3 4 10322 26542 ... 0 0 0 0 0 0 1 0 1 0
1332 29 459 24 2 4 73 2 4 2439 14753 ... 0 0 1 0 0 0 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1439 36 557 3 3 1 94 2 4 7644 12695 ... 0 0 0 1 0 0 1 0 1 0
481 34 254 1 2 2 83 2 4 3622 22794 ... 0 0 1 0 0 0 1 0 0 1
124 31 249 6 4 2 76 1 3 6172 20739 ... 0 0 0 1 0 0 1 0 0 1
198 38 1261 2 4 4 88 3 3 6553 7259 ... 1 0 0 0 0 0 1 0 1 0
1229 40 369 8 2 2 92 3 1 6516 5041 ... 1 0 0 0 0 0 1 0 0 1

294 rows × 48 columns

In [48]:
from xgboost import XGBClassifier

# get an instance from the clf
model_XGB = XGBClassifier()

# fit data
model_XGB.fit(X_train, y_train)

# predict unseen data
pred_XGB = model_XGB.predict(X_test)

print('Accuracy of XGBoost classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_test, pred_XGB)))
print('Recall of XGBoost classifier on test set: {:.2f}'.format(metrics.recall_score(y_test, pred_XGB)))
print('Precision of XGBoost classifier on test set: {:.2f}'.format(metrics.precision_score(y_test, pred_XGB)))
print('F1-Score of XGBoost classifier on test set: {:.2f}'.format(metrics.f1_score(y_test, pred_XGB)))
print('AUC Score of XGBoost classifier on test set: {:.2f}'.format(metrics.roc_auc_score(y_test, pred_XGB)))
print('XGBoost - Confusion Matrix \n', metrics.confusion_matrix(y_test, pred_XGB))
Accuracy of XGBoost classifier on test set: 0.85
Recall of XGBoost classifier on test set: 0.27
Precision of XGBoost classifier on test set: 0.62
F1-Score of XGBoost classifier on test set: 0.37
AUC Score of XGBoost classifier on test set: 0.62
XGBoost - Confusion Matrix 
 [[237   8]
 [ 36  13]]
In [49]:
X_test
Out[49]:
Age DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction MonthlyIncome MonthlyRate ... JobRole_Manufacturing Director JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single OverTime_No OverTime_Yes
442 36 635 10 4 2 32 3 4 9980 15318 ... 0 0 0 1 0 0 0 1 1 0
1091 33 575 25 3 4 44 2 2 4320 24152 ... 1 0 0 0 0 0 0 1 1 0
981 35 662 18 4 4 67 3 3 4614 23288 ... 0 0 0 1 0 0 1 0 0 1
785 40 1492 20 4 1 61 3 4 10322 26542 ... 0 0 0 0 0 0 1 0 1 0
1332 29 459 24 2 4 73 2 4 2439 14753 ... 0 0 1 0 0 0 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1439 36 557 3 3 1 94 2 4 7644 12695 ... 0 0 0 1 0 0 1 0 1 0
481 34 254 1 2 2 83 2 4 3622 22794 ... 0 0 1 0 0 0 1 0 0 1
124 31 249 6 4 2 76 1 3 6172 20739 ... 0 0 0 1 0 0 1 0 0 1
198 38 1261 2 4 4 88 3 3 6553 7259 ... 1 0 0 0 0 0 1 0 1 0
1229 40 369 8 2 2 92 3 1 6516 5041 ... 1 0 0 0 0 0 1 0 0 1

294 rows × 48 columns

1.2.16 Based on your answer from 1.2.6 and other important evaluation metrics for unbalanced datasets, choose the best classifier and plot its feature importances in decreasing order. Were the 3 most important features as you expected ? Please explain why. (0.4 points)

Answer 1:

  • Based on our model from task 1.2.6, we choose Random Forest Classifier as the best model, because it produces the highest precision value which is 90%
  • The 3 most important features is different compare to what our expect before which are: Monthly Income, TotalWorkingYears, HourlyRate. But in our Random Forest Classifier it shows: Monthly Income, Age, and DailyRate
In [50]:
# 3 most important features
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(model_rf.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False)
importances.head(10)
Out[50]:
feature importance
8 MonthlyIncome 0.079
0 Age 0.068
1 DailyRate 0.058
13 TotalWorkingYears 0.054
5 HourlyRate 0.051
9 MonthlyRate 0.050
16 YearsAtCompany 0.047
2 DistanceFromHome 0.044
19 YearsWithCurrManager 0.034
10 NumCompaniesWorked 0.034
In [51]:
import plotly.express as plex

plex.bar(x=importances.feature, y=importances.importance)

2. Improving classification (2 points)

In this task we will try to improve the performance of the best classifier you selected on 1.2.12 by using several techniques.

2.1 Do you think it is better to try oversampling or downsampling in this case study and why ? (0.4 points)

Answer 1:

We choose Oversampling, because:

  • The amount of collected data is not sufficient for undersampling
  • If we used Undersampling, the datasize will reduced into around 200 and it's too small for fitting process in our classifier model

References:

2.2 Apply oversampling to the data while keeping random_state equal to 0. (0.4 points)

In [52]:
# Classes count
count_class_0, count_class_1 = hr_data_dummies.Attrition.value_counts()

# Divide by class
df_class_0 = hr_data_dummies[hr_data_dummies['Attrition'] == 0]
df_class_1 = hr_data_dummies[hr_data_dummies['Attrition'] == 1]
In [53]:
# from Rectangle import Rectangle
colors = [ '#66CDAA',  '#6495ED']

df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print('Random over-sampling:')
print(df_test_over.Attrition.value_counts())

df_test_over.Attrition.value_counts().plot(kind='bar', title='Count (target)', color=colors);
Random over-sampling:
1    1233
0    1233
Name: Attrition, dtype: int64

2.3 Split the data into train/test set with a ratio 80/20. Keep a random_state equal to 0. Use the algorithm chosen in 1.2.12 and report accuracy, precision, recall, f1-score and AUC. (0.4 points)

In [54]:
X = df_test_over.drop(columns = ['Attrition'])
y = df_test_over.Attrition
In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model_rf = RandomForestClassifier(random_state = 0)
model_rf.fit(X_train, y_train)
pred_rf = model_rf.predict(X_test)

print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_test, pred_rf)))
print('Recall of Random Forest classifier on test set: {:.2f}'.format(metrics.recall_score(y_test, pred_rf)))
print('Precision of Random Forest classifier on test set: {:.2f}'.format(metrics.precision_score(y_test, pred_rf)))
print('F1-Score of Random Forest classifier on test set: {:.2f}'.format(metrics.f1_score(y_test, pred_rf)))
print('AUC Score of Random Forest classifier on test set: {:.2f}'.format(metrics.roc_auc_score(y_test, pred_rf)))
print('Random Forest - Confusion Matrix \n', metrics.confusion_matrix(y_test, pred_rf))
Accuracy of Random Forest classifier on test set: 0.97
Recall of Random Forest classifier on test set: 1.00
Precision of Random Forest classifier on test set: 0.94
F1-Score of Random Forest classifier on test set: 0.97
AUC Score of Random Forest classifier on test set: 0.97
Random Forest - Confusion Matrix 
 [[234  16]
 [  0 244]]

2.4 Apply undersampling to the data while keeping random_state equal to 0. (0.4 points)

In [56]:
colors = [ '#66CDAA',  '#6495ED']

df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print('Random under-sampling:')
print(df_test_under.Attrition.value_counts())

df_test_under.Attrition.value_counts().plot(kind='bar', title='Count (target)',color=colors);
Random under-sampling:
1    237
0    237
Name: Attrition, dtype: int64

2.5 Split the data into train/test set with a ratio 80/20. Keep a random_state equal to 0. Use the algorithm chosen in 1.2.12 to classify the test data and report accuracy, precision, recall, f1-score and AUC. (0.4 points)

In [57]:
X = df_test_under.drop(columns = ['Attrition'])
y = df_test_under.Attrition
In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model_rf = RandomForestClassifier(random_state = 0)
model_rf.fit(X_train, y_train)
pred_rf = model_rf.predict(X_test)

print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(metrics.accuracy_score(y_test, pred_rf)))
print('Recall of Random Forest classifier on test set: {:.2f}'.format(metrics.recall_score(y_test, pred_rf)))
print('Precision of Random Forest classifier on test set: {:.2f}'.format(metrics.precision_score(y_test, pred_rf)))
print('F1-Score of Random Forest classifier on test set: {:.2f}'.format(metrics.f1_score(y_test, pred_rf)))
print('AUC Score of Random Forest classifier on test set: {:.2f}'.format(metrics.roc_auc_score(y_test, pred_rf)))
print('Random Forest - Confusion Matrix \n', metrics.confusion_matrix(y_test, pred_rf))
Accuracy of Random Forest classifier on test set: 0.77
Recall of Random Forest classifier on test set: 0.84
Precision of Random Forest classifier on test set: 0.72
F1-Score of Random Forest classifier on test set: 0.78
AUC Score of Random Forest classifier on test set: 0.77
Random Forest - Confusion Matrix 
 [[35 15]
 [ 7 38]]

How long did it take you to solve the homework?

  • Please answer as precisely as you can. It does not affect your points or grade in any way. It is okay, if it took 0.5 hours or 24 hours. The collected information will be used to improve future homeworks.

**Answer:**

(please change X in the next cell into your estimate)

15 hours

What is the level of difficulty for this homework?

you can put only number between $0:10$ ($0:$ easy, $10:$ difficult)

**Answer:** 6

In [ ]: